This course - Introduction to Open Data Science - makes use of the open research tools with open data and how to apply data-driven statistics.
Link to my project: https://github.com/SusannaSuntila/IODS-project
date()
## [1] "Sat Nov 26 14:29:20 2022"
I am excited and I only hope that I can keep up with the pace and content of this course, but it all sounds very interesting. I have taken some courses in R, but I have not used it professionally or for that long. I want to become even more familiar in using R and I want to deepen my skills in visualizing, sharing, wrangling, testing and analyzing data. I heard about this course in my previous course for research skills.
date()
## [1] "Sat Nov 26 14:29:20 2022"
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'purrr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# the data-set that I created
learning2014 <- read.csv(file.path(".", "data", "learning2014.csv"))
# Look at the dimensions of the data
dim(learning2014)
## [1] 166 7
# Look at the structure of the data
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
There are 166 rows (observations) and 7 columns (variables) in this data. The gender variable is a character, the other variables are numbers or integers.
The original data is from a questionnaire that tracked different learning approaches and achievements on an introductory statistical course at university level. Gender and age are self-explanatory. The points variable tells how many exam points the person had. The attitude variable is a sum of 10 questions related to student’s attitude towards statistics, in a 1-5 scale. The remaining variables deep, stra and surf are combination variables that have been combined from multiple questions with the same dimension. The variables are averages of the selected questions concerning deep, surface and strategic learning.
library(ggplot2)
# set the theme white
theme_set(theme_bw())
# plot of student's attitude and points
ggplot(learning2014, aes(x = attitude, y = points, col = gender)) +
geom_point() +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("lightpink2", "maroon")) +
labs(title = "Student's attitude compared to exam points",
subtitle = "learning2014 data")
## `geom_smooth()` using formula 'y ~ x'
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])
# add color by gender
gender_col <- c("lightpink2", "maroon")[unclass(factor(learning2014$gender))]
pairs(learning2014[-1], pch = 19, col = gender_col)
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# with color by gender
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20))) +
scale_colour_manual(values = c("lightpink2", "maroon")) +
scale_fill_manual(values = c("lightpink2", "maroon"))
First, there are clearly more women who have participated in this questionnaire than men. When looking at the distribution of the variables, it is clear that age is concentrated below thirty, with a long tail towards right. With the age variable, the skew is larger than with any other variable. There is also more variation between men compared to women. Interestingly with the attitude variable, there is more variation with women compared to men. Men also have slightly higher mean, and their distribution is more skewed. The deep variable has a longer tail on the left side, so the observations are more concentrated on the higher side. The difference between genders is very small in this case. The stra variable has a wide distribution. The surf variable is more concentrated among women compared to men. The point variable has a thick tale among lower scores and the distribution is more concentrated among higher scores.
The largest and most significant correlation can be found with attitude and points. It is positive, so better attitude is related to better points in the exam. The lowest correlation is between deep learning and points from the exam which is interesting as well. Surf variable is also significantly correlating with attitude and deep.
library(GGally)
library(ggplot2)
# draw a plot of the linear relation of exam points and attitude
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# create a regression model with multiple explanatory variables
model1 <- lm(points ~ attitude + gender + stra, data = learning2014)
# print out a summary of the model
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + gender + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7179 -3.3285 0.5343 3.7412 10.9007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9798 2.4030 3.737 0.000258 ***
## attitude 3.5100 0.5956 5.893 2.13e-08 ***
## genderM -0.2236 0.9248 -0.242 0.809231
## stra 0.8911 0.5441 1.638 0.103419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared: 0.2051, Adjusted R-squared: 0.1904
## F-statistic: 13.93 on 3 and 162 DF, p-value: 3.982e-08
In this model the dependent variable is exam points as instructed. The three explanatory variables that I chose are attitude, gender and stra. In the summary of the model it can be seen that only the intercept and the variable attitude are statistically significant, so attitude is the only variable strongly associated with exam points in this particular model. The coefficient on attitude, 3.5, tells that one unit change in attitude is related to a 3.5 point increase in exam points, conditional on other variables of this model. The gender variable is a dummy variable that has the value 1 when the person’s gender is male, so the coefficient of the gender variable describes the difference between the two genders, and when the person is male, it lowers the exam points by -0.22, again conditional on the other variables of this model. The stra variable that is computed from strategic questions has a coefficient of 0.89, meaning that better strategic learning will increase the exam points. The F-test that all three coefficients would be zero has a very small p-value, so it can be rejected.
As the variables gender and stra were not statistically significant, I dropped the gender variable first in the next model.
# create a second regression model
model2 <- lm(points ~ attitude + stra, data = learning2014)
# print out a summary of the model
summary(model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
The attitude variable is still statistically highly significant. The attitude variable is positively related to the exam points, with 3.5 increase with one unit change in attitude for the better conditional on the stra variable. The intercept of this model means that if attitude would be 0, exam points would be 11.6 according to this model. What is interesting, is that when gender variable is removed from the model, the stra variable becomes significant at 10 percent level. It has a coefficient 0.9, meaning that one unit increase in strategic learning is related to a 0.9 increase in exam points, conditional on the attitude variable.
The multiple R-squared is 0.20, which means that the two variables attitude and stra together explain 20 percent of the variation in exam points.
Compared to the last model, the F-statistic has risen as well, and the p-value is smaller.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(model2,which = c(1,2,5))
The first plot that compares the residuals with the fitted values shows that the points are spread around quite randomly, except for a few outliers, so the model is pretty appropriate. The plot does not indicate non-linear trends or non-constant variance.
The Q-Q-plot shows that the residuals are somewhat normally distributed. However, the lower tail is more heavy, so the values are larger there than would be expected and the upper tail is lighter, so values are smaller there than expected.
The residuals vs leverage plot shows if there are any outliers that would be significant for the model. The Cook’s distance curved lines don’t show in this plot, so the outliers aren’t too disturbing and there aren’t any points that would have high residuals and too much leverage at the same time.
date()
## [1] "Sat Nov 26 14:29:50 2022"
library(tidyverse)
# read in the data that was created
alc <- read.csv(file.path(".", "data", "alc.csv"))
# print names of the variables
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data set is constructed from a secondary school student questionnaire of Portuguese schools. As the names of the variables indicate, the questions cover topics such as school and studies, but have also social and demographic aspects. The alc data set is combined from two data sets dealing with the students’ performance in math and Portuguese language.
The binary variable address is an interesting one, I would predict that when living in urban areas high alcohol use would be more probable, as there are more options to use alcohol than for students living in rural areas.
Another variable that I want to look at is the studytime one (weekly study time, divided into 4 groups, where 4 is the highest amount spent in studying), to see if spending more time studying is negatively related to high alcohol use, as the student is spending more time with studies and is possibly more committed to school.
I would hypothesize also that goout variable, telling how much the student goes out with friends (1-5 scale where 5 is very high) is positively linked to high alcohol use, as alcohol is usually a socially related issue.
Lastly I wanted to include the more continuous variable absences, which measures the number of school absences (0-93). I would hypothesize that more absences are positively linked to high alcohol use, as drinking a lot might affect the student’s attendance at school.
First the address variable
library(ggplot2)
# set the background white
theme_set(theme_bw())
# bar plots of address variable
g1 <- ggplot(data = alc, aes(x = address, fill = high_use)) +
geom_bar() +
theme(legend.position = "none") +
scale_fill_manual(values = c("paleturquoise3", "paleturquoise4"))
g2 <- ggplot(data = alc, aes(x = address, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion") +
scale_fill_manual(values = c("paleturquoise3", "paleturquoise4"))
# combine the two plots
library(patchwork)
g1 | g2
As can be expected, most of the students live in urban areas, and so more people have high use of alcohol levels in urban areas, but contrary to what I thought relatively more people have high use of alcohol in rural areas.
Next the studytime variable:
# studytime variable
g3 <- ggplot(data = alc, aes(x = studytime, fill = high_use)) +
geom_bar() +
theme(legend.position = "none") +
scale_fill_manual(values = c("plum3", "plum4"))
g4 <- ggplot(data = alc, aes(x = studytime, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion") +
scale_fill_manual(values = c("plum3", "plum4"))
# combine the two plots
library(patchwork)
g3 | g4
As I thought, less studytime has higher proportion of students who have high use of alcohol. This is clear for both absolutely and proportionally.
Next the goout variable:
# the going out variable
g5 <- ggplot(data = alc, aes(x = goout, fill = high_use)) +
geom_bar() +
theme(legend.position = "none") +
scale_fill_manual(values = c("palevioletred3", "palevioletred4"))
g6 <- ggplot(data = alc, aes(x = goout, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion") +
scale_fill_manual(values = c("palevioletred3", "palevioletred4"))
# combine the two plots
library(patchwork)
g5 | g6
Here the high use of alcohol increases with the amount of going out with friends, and the pattern is very clear.
And lastly the absences variable:
# absences
g7 <- ggplot(data = alc, aes(x = absences, fill = high_use)) +
geom_bar() +
theme(legend.position = "none") +
scale_fill_manual(values = c("pink3", "pink4"))
g8 <- ggplot(data = alc, aes(x = absences, fill = high_use)) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("pink3", "pink4")) +
ylab("proportion")
# combine the two plots
library(patchwork)
g7 | g8
Here it is also clear that increase in absences is linked to an increase in high use of alcohol as well, but there are also some outliers with few students who have a very high amount of absences.
Looking at the absences variable closer with a box plot:
# box plot of absences with color by address
ggplot(data = alc, aes(x = high_use, y = absences, col = address)) +
geom_boxplot() +
scale_color_manual(values = c("violetred", "violetred4"))
The median of absences is the same whether student is form rural or urban area when there is no high use of alcohol, but the median for absences is slightly higher for students from rural areas if they belong to the high use of alcohol group.
Next few cross-tabulations. First I wanted to add a table that includes address, the mean of going out and of course high use of alcohol.
# table of address, high use of alcohol and the mean of going out
alc %>% group_by(address, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## `summarise()` has grouped output by 'address'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups: address [2]
## address high_use count mean_goout
## <chr> <lgl> <int> <dbl>
## 1 R FALSE 50 2.6
## 2 R TRUE 31 3.52
## 3 U FALSE 209 2.91
## 4 U TRUE 80 3.81
The mean of going out increases quite a lot when the alcohol use is high for both rural and urban area students. The mean of going out in both cases of alcohol use is higher for students who live in urban areas compared to rural area students, as would be expected.
Another cross-tabulation of the mean of studytime and mean of absences with regards to alcohol use:
# table of address, high use of alcohol and the mean of going out
alc %>% group_by(high_use) %>% summarise(count = n(), mean_studytime = mean(studytime), mean_absences = mean(absences))
## # A tibble: 2 × 4
## high_use count mean_studytime mean_absences
## <lgl> <int> <dbl> <dbl>
## 1 FALSE 259 2.16 3.71
## 2 TRUE 111 1.77 6.38
Those students who have high use of alcohol also have lower mean of studytime and higher mean of absences compared to those who do not have high use of alcohol.
# simple model with only the absences as an explanatory variable
model <- glm(high_use ~ absences, data = alc, family = "binomial")
#summary of the simple model
summary(model)
##
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3598 -0.8209 -0.7318 1.2658 1.7419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.26945 0.16094 -7.888 3.08e-15 ***
## absences 0.08867 0.02317 3.827 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 434.52 on 368 degrees of freedom
## AIC: 438.52
##
## Number of Fisher Scoring iterations: 4
####
# model with all the chosen variables: absences, studytime, goout and address
model1 <- glm(high_use ~ absences + studytime + goout + address, data = alc, family = "binomial")
# print out a summary of the model
summary(model1)
##
## Call:
## glm(formula = high_use ~ absences + studytime + goout + address,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9982 -0.7761 -0.4950 0.8461 2.3232
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.92997 0.56532 -3.414 0.000640 ***
## absences 0.06687 0.02221 3.011 0.002600 **
## studytime -0.59099 0.16862 -3.505 0.000457 ***
## goout 0.75392 0.12061 6.251 4.09e-10 ***
## addressU -0.73038 0.30408 -2.402 0.016308 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 371.44 on 365 degrees of freedom
## AIC: 381.44
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model1)
## (Intercept) absences studytime goout addressU
## -1.92997318 0.06687033 -0.59099384 0.75391575 -0.73038272
# compute odds ratios (OR)
OR <- coef(model1) %>% exp
# compute confidence intervals (CI)
CI <- confint(model1) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1451521 0.0466309 0.4305313
## absences 1.0691568 1.0250338 1.1197230
## studytime 0.5537766 0.3936409 0.7639234
## goout 2.1253059 1.6882118 2.7118282
## addressU 0.4817246 0.2648416 0.8757487
I looked at first a simple model, where absences was the only explanatory variable, and the AIC was 438.52. After adding the three other variables, the AIC had decreased to 381.44, so adding the other variables has improved the model.
When looking at the summary of the fitted model with all the chosen variables, all the coefficients are statistically significant at the 5 % level. As the exploration of the variables earlier implied, the relationship is positive with absences and goout, so increase in these variables will increase the odds of the student having high use of alcohol. Studytime and address on the other hand decrease the odds of having high use of alcohol.
Looking at the coefficients more closer, absences has an odds ratio of 1.069, meaning that a unit increase in absences increase the odds of the student having high use of alcohol by 6.9% keeping other variables constant, and the confidence interval is between 2.5% and 11.97%. This isn’t as large an effect, as I might have expected.
Studytime has a coefficient that is less than one, 0.55, so when studytime increases by one more unit, it decreases the odds of the student having high use of alcohol by 45% when other variables are constant, and the 95% CI is 61% - 24% decrease in odds. Studytime has an opposite effect when increasing, compared to absences.
The goout variable has an odds ratio of 2.125, meaning that a unit increase in goout increases the odds of the student having high use of alcohol by 112.5% again adjusting for the other variables in the model. The 95% CI is 68.8% - 171%. So going out has quite a large effect, which was apparent when plotting it in the beginning.
The address variable has an odds ratio of 0.48, meaning that if the student lives in an urban area, it decreases the odds of the student having high use of alcohol by 52%, when adjusting for the other variables. The negative effect has a 95% CI between 74% - 13%. This is quite surprising as in the beginning I wasn’t even sure of the direction of the effect.
First a 2x2 cross tabulation of predictions versus the actual values:
# fit the model
model1 <- glm(high_use ~ absences + studytime + goout + address, data = alc, family = "binomial")
# predictions
library(dplyr)
alc <- mutate(alc, probability = predict(model1, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 237 22
## TRUE 63 48
# initialize a plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability, y = high_use, col = prediction)) +
geom_point() +
scale_color_manual(values = c("skyblue2", "skyblue4"))
This model predicts high use incorrectly for 22 students, and low use incorrectly for 63 students.
Next the the total proportion of inaccurately classified individuals, the training error.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297
The training error is 0.2297, so on average ~23% of the predictions are wrong. The training error is below 0.5, so it is better than randomly guessing.
#K-fold cross-validation, K=10
library(boot)
## Warning: package 'boot' was built under R version 4.2.2
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model1, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2459459
One set of the 10-fold cross-validation gives the prediction error of 0.2378, so 23.78% of the predictions are wrong. This is slightly smaller than the model in the exercise set had, 0.26, so the model that I have used has a bit better test set performance.
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
There are 506 observations and 14 variables in this data set from MASS-package. The data is formed from housing values in suburbs of Boston. For example the variable crim tells the per capita crime rate by town.
library(GGally)
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
#summary of the variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# draw a scatter plot matrix of the variables
pairs(Boston)
# the distribution of all the variables
ggplot(data = melt(Boston), aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
## No id variables; using all as measure variables
# correlations between the variables
cor_matrix <- cor(Boston) %>%
round(., digits = 2)
corrplot.mixed(cor_matrix)
Summaries of the variables show that the scale varies a lot for the variables.
When looking at the distributions of the variables, only the variable rm (average number of rooms per dwelling) is close to normal distribution and medv (median value of owner-occupied homes in 1000s) is somewhat normal. The variable of interest, crim, has most of the observations at the left tail as does the variable zn (proportion of residential land zoned for lots over 25,000 sq.ft). The reverse is true for the variable black (1000(Bk−0.63)^2 where Bk is the proportion of blacks by town). There are also few variables that have two peaks at their distribution; indus (proportion of non-retail business acres per town), rad (index of accessibility to radial highways) and tax (full-value property-tax rate per $10,000). Rest of the variables are skewed to left or right.
Finally, when looking at the correlation plot, crim correlates (positively) the most with rad. Largest correlations overall can be found with nox and dis (-0.77) and with rad and tax(0.91). It seems like indus, nox, dis, rad and tax correlate the most with other variables. On the other hand the variable chas (Charles River dummy variable) doesn’t really correlate with any variable.
First standardize the data set
# use the scale function
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# summaries of the scaled data
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
The initial data had very wide range of values for each variable and the sclales varied a lot depending on the variable, so standardizing has normalized the range of the values.
Next the creation of a factor variable form the crim variable:
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high") ,include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
And finally the creation of train and test sets:
# Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2400990 0.2425743 0.2648515
##
## Group means:
## zn indus chas nox rm age
## low 1.04036289 -0.9434023 -0.07933396 -0.8806782 0.4140304 -0.8542608
## med_low -0.07504504 -0.2654090 -0.02879709 -0.5842920 -0.1016162 -0.3890089
## med_high -0.38923530 0.1876736 0.16959035 0.3024727 0.1195070 0.4139652
## high -0.48724019 1.0170108 -0.08835242 1.0468362 -0.3513018 0.7870990
## dis rad tax ptratio black lstat
## low 0.9487836 -0.6902506 -0.7179538 -0.51291617 0.3715615 -0.75413412
## med_low 0.3592130 -0.5544520 -0.4635403 -0.04136557 0.3494352 -0.18409163
## med_high -0.3506345 -0.4088096 -0.3135772 -0.20334344 0.1163236 0.02863549
## high -0.8411626 1.6392096 1.5148289 0.78203563 -0.7633519 0.89350604
## medv
## low 0.49200670
## med_low 0.03432029
## med_high 0.13977889
## high -0.74465768
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11559173 0.83282968 -0.85485840
## indus 0.03204765 -0.36890224 0.30977336
## chas -0.07889343 -0.09281943 -0.01285010
## nox 0.41831722 -0.51902352 -1.39132063
## rm -0.10262130 -0.13304780 -0.12880127
## age 0.23793715 -0.20623494 -0.39047974
## dis -0.06199268 -0.24984016 0.01223446
## rad 3.22925853 0.90919774 -0.01821483
## tax -0.06172658 -0.01474090 0.59260171
## ptratio 0.12689567 0.01358643 -0.15334430
## black -0.12982312 0.00669120 0.17791210
## lstat 0.20568303 -0.30390424 0.37193799
## medv 0.14973403 -0.36361153 -0.12748793
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9531 0.0355 0.0114
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "navy", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 2)
# save the crime categories from the test set
# then remove the categorical crime variable from the test dataset
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Next predict the classes with the LDA model on the test data and cross tabulate the results with the crime categories from the test set.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 12 1 0
## med_low 3 19 7 0
## med_high 0 4 23 1
## high 0 0 0 20
The LDA model has predicted almost all the observations belonging to the high class correctly. On the lower end, there are few observations that have been predicted to the med_low instead of the correct low class. There is more variability in the predictions for the medium classes. The prediction for med_low has wrongly predicted for low class almost as much observations. So the LDA model has more difficulties in predicting the observations belonging in the middle than on the tails.
Reload the Boston dataset and standardize the dataset:
library(MASS)
data("Boston")
# use the scale function
boston_scaled2 <- as.data.frame(scale(Boston))
boston_scaled2$crim <- as.numeric(boston_scaled2$crim)
Calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again:
set.seed(17)
# k-means clustering
km <- kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# optimal number of clusters seems to be 2, as there is a large drop at that point.
# k-means clustering with two clusters
km <- kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = c("steelblue3", "sienna3")[km$cluster])
Looking at the pairs function it appears that one of the clusters has always absorbed the observations on one end or direction, so that two clusters looks quite appropriate when comparing the plots of all the variables against each other.
library(MASS)
data("Boston")
set.seed(19)
# use the scale function to standardize the data
boston_scaled2 <- as.data.frame(scale(Boston))
boston_scaled2$crim <- as.numeric(boston_scaled2$crim)
# k-means clustering with 4 clusters
km <- kmeans(boston_scaled2, centers = 4)
# linear discriminant analysis
lda.fit2 <- lda(km$cluster ~ ., data = boston_scaled2)
# print the lda.fit object
lda.fit2
## Call:
## lda(km$cluster ~ ., data = boston_scaled2)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.39920949 0.31818182 0.05731225 0.22529644
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3554389 -0.4039269 -0.004726028 0.11748284 0.009509773 -0.2448635
## 2 -0.4071151 0.9395565 -0.951448942 -0.12560484 -0.934654735 0.6775631
## 3 3.0022987 -0.4872402 1.014994623 -0.27232907 1.059334474 -1.3064650
## 4 0.4410309 -0.4872402 1.093886784 0.03849464 1.033664373 -0.1906821
## age dis rad tax ptratio black lstat
## 1 0.3085558 -0.225942 -0.5764965 -0.5036322 -0.09996773 0.2544955 0.08303063
## 2 -1.0699238 1.023927 -0.5966711 -0.6791796 -0.53145201 0.3578780 -0.86641188
## 3 0.9805356 -1.048472 1.6596029 1.5294129 0.80577843 -1.1906614 1.87087595
## 4 0.7148590 -0.779002 1.4419986 1.4625319 0.72271650 -0.6534850 0.60056774
## medv
## 1 -0.09324226
## 2 0.72802972
## 3 -1.31020021
## 4 -0.52966703
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim 0.35322461 0.307758951 -1.58997833
## zn 0.09849662 -0.330566750 -0.14199716
## indus 0.27720773 -0.031523290 0.36923953
## chas 0.03859298 0.112548411 0.06821859
## nox 0.07110504 -0.230199232 0.17319460
## rm -0.27410931 -0.568297722 0.31638241
## age 0.16778369 0.815075738 0.30710945
## dis -0.28454054 -0.606949418 -0.07519390
## rad 1.75298697 -0.745321060 0.47791025
## tax 1.03317493 -0.661045609 0.17536769
## ptratio 0.16021535 -0.003247889 0.05098139
## black -0.19237795 0.050874618 0.06252285
## lstat 0.27297507 0.032086428 -0.60051563
## medv 0.09553734 -0.204714316 -0.54447145
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8386 0.0934 0.0680
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "navy", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km$cluster)
# plot the lda results
plot(lda.fit2, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit2, myscale = 3)
There are 4 distinct clusters that are somewhat separate, but the 4th cluster has some of the observations in the middle of the plot, not clearly belonging to any cluster. Clusters 1 and 2 are close to each other as are clusters 3 and 4.
It appears that rad, tax, age, dis and rm have the highest influence in separating the clusters. Looking at the arrows, tax and rad seem to be explaining more for cluster 4, and rm and dis more for cluster 2 for example.
# run the given code
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#first plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# target classes as numeric
classes <- as.numeric(train$crime)
# color by classes
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes)
# k-means clustering
km <- kmeans(model_predictors, centers = 4)
# target classes as numeric
classes2 <- as.numeric(km$cluster)
# color by km clusters
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes2)
All the plots have the same observations with most of the observations in a bigger groups and then a separate smaller group. The second plot that has the colors determined by crime classes has the separate smaller group belonging to one crime class. The third plot with the clusters has also the same smaller group in one cluster, but the lines between the clusters change a bit compared to the observations belonging to the crime classes.